Snapshot Judgements: Obtaining Data Insights without Tracing

نویسندگان

  • Wenxuan Wang
  • Ian F. Adams
  • Avani Wildani
چکیده

Metadata snapshots are a favored method for gaining filesystem insights due to their small size and relative ease of acquisition compared to access traces [2]. Since snapshots do not include an access history; typically they are used for relatively simple analyses such as file lifetime and size distributions, and researchers still gather and store full block or file access traces for any higher level analysis such as cache prediction or scheduling variable replication [1, 3]. We claim that one can gain rich insights into file system and user behavior by clustering metadata snapshots and comparing the entropy within clusters to the entropy within natural partitions such as directory hierarchies or single attributes. We have preliminary results indicating that agglomerative clustering methods produce groups of data with high information purity, which may be a sign of functional correlation. While many studies have analyzed metadata snapshots, most focus on simple statistics, such as file size, age, or extension, or they attempt to reconstruct dynamic trace information from a series of snapshots by interpolating inter-snapshot accesses. We focus instead on what can be learned about a system by looking at metadata correlations within a small set of widely spaced snapshots. For example, timestamps can give insight into the dynamic activity of the system from a purely static viewpoint. UIDs can be used in conjunction with file paths to figure out if there is a “typical” namespace structure users create. Entropy between members of a namespace can help us relate different segments of a trace [4]. Full I/O traces are always superior, but keeping complete logs of accesses is prohibitive in many systems because of the computational overhead to collect the logs and the storage overhead to keep them. For a modern storage system with hundreds of thousands of I/Os per second, storing even minimal representations of the I/O without any metadata is very costly. For example, an enterprise storage system may create over 16 GB of blocklevel I/O logs per day [5]. Moreover, storing complete traces with metadata is even harder than storing raw accesses because there is more overhead both in terms of size and performance, thus this information is usually lost. We examined a series of clusterings using HPC and Figure 1: Sample clusterings view for a single snapshot. Clusters are indicated by shape and modification time is indicated by color.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Annual performance assessment of services for children and young people in the London Borough of Lambeth 2008

This letter summarises the findings of the 2008 annual performance assessment (APA) for your council. The evaluations and judgements in the letter draw on a range of data and information which covers the period 1 April 2007 to 31 March 2008. As you know, the APA is not based on an inspection of your services and, therefore, can only provide a snapshot based on the evidence considered. As such, ...

متن کامل

Annual performance assessment of services for children and young people in West Sussex County

This letter summarises the findings of the 2008 annual performance assessment (APA) for your council. The evaluations and judgements in the letter draw on a range of data and information which covers the period 1 April 2007 to 31 March 2008. As you know, the APA is not based on an inspection of your services and, therefore, can only provide a snapshot based on the evidence considered. As such, ...

متن کامل

Point-Versus Interval-Based Temporal Data Models

The association of timestamps with various data items such as tuples or attribute values is fundamental to the management of time-varying information. Using intervals in timestamps, as do most data models, leaves a data model with a variety of choices for giving a meaning to timestamps. Specifically, some such data models claim to be point-based while other data models claim to be interval-base...

متن کامل

Ancestry.com Online Forum Test Collection

This report outlines the construction of the Ancestry.com Forum document collection and information retrieval test collection. The Ancestry.com Forum Dataset was created with the cooperation of Ancestry.com in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the Ancestry.com online forum, boards.ancestry.c...

متن کامل

Link Reassignment based Snapshot Partition for Polar-orbit LEO Satellite Networks

Abstract—Snapshot is a fundamental notion proposed for routing in mobile low earth orbit (LEO) satellite networks which is characterized with predictable topology dynamics. Its distribution has a great impact on the routing performance and on-board storage. Originally, the snapshot distribution is invariable by using the static snapshot partition method based on the mechanical steering antenna....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017